What is a Vector Database?
原文:What is a Vector Database? | Pinecone
We’re in the midst of the AI revolution. It’s upending any industry it touches, promising great innovations - but it also introduces new challenges. Efficient data processing has become more crucial than ever for applications that involve large language models, generative AI, and semantic search.
我们正处于人工智能革命的中心。它正在颠覆所有涉及的行业,承诺带来巨大的创新,但同时也引入了新的挑战。对于涉及大型语言模型、生成式人工智能和语义搜索的应用程序来说,高效的数据处理变得比以往任何时候都更为关键。
All of these new applications rely on vector embeddings, a type of data representation that carries within it semantic information that’s critical for the AI to gain understanding and maintain a long-term memory they can draw upon when executing complex tasks.
所有这些新应用都依赖于向量嵌入(vector embeddings),这是一种数据表示类型,其中包含了对于人工智能获得理解和保持长期记忆至关重要的语义信息。当执行复杂任务时,人工智能可以利用这些嵌入来获取所需的信息。
Embeddings are generated by AI models (such as Large Language Models) and have a large number of attributes or features, making their representation challenging to manage. In the context of AI and machine learning, these features represent different dimensions of the data that are essential for understanding patterns, relationships, and underlying structures.
嵌入是由人工智能模型(如大型语言模型)生成的,它们具有大量的属性或特征,使得它们的表示难以管理。在人工智能和机器学习的上下文中,这些特征代表了数据的不同维度,对于理解模式、关系和底层结构至关重要。
That is why we need a specialized database designed specifically for handling this type of data. Vector databases like Pinecone fulfill this requirement by offering optimized storage and querying capabilities for embeddings. Vector databases have the capabilities of a traditional database that are absent in standalone vector indexes and the specialization of dealing with vector embeddings, which traditional scalar-based databases lack.
这就是为什么我们需要专门设计用于处理这种类型数据的数据库。像Pinecone这样的向量数据库满足了这一要求,它提供了针对嵌入的优化存储和查询功能。向量数据库具备传统数据库的功能,这是独立的向量索引所没有的,并且专注于处理向量嵌入,而传统基于标量的数据库则缺乏这种特化能力。
The challenge of working with vector embeddings is that traditional scalar-based databases can’t keep up with the complexity and scale of such data, making it difficult to extract insights and perform real-time analysis. That’s where vector databases come into play – they are intentionally designed to handle this type of data and offer the performance, scalability, and flexibility you need to make the most out of your data.
处理向量嵌入数据的挑战在于,传统的基于标量的数据库无法应对这种数据的复杂性和规模,这使得提取洞见和进行实时分析变得困难。这就是向量数据库发挥作用的地方——它们被有意设计用于处理这种类型的数据,并提供了您所需的性能、可扩展性和灵活性,使您能够充分利用数据。
With a vector database, we can add advanced features to our AIs, like semantic information retrieval, long-term memory, and more. The diagram below gives us a better understanding of the role of vector databases in this type of application:
通过使用向量数据库,我们可以为我们的人工智能添加高级功能,如语义信息检索、长期记忆等。下面的图表能更好地帮助我们理解向量数据库在这类应用中的作用:
Let’s break this down:
让我们来详细解析一下:
- First, we use the embedding model to create vector embeddings for the content we want to index.首先,我们使用嵌入模型为我们要索引的内容创建向量嵌入。
- The vector embedding is inserted into the vector database, with some reference to the original content the embedding was created from. 向量嵌入被插入到向量数据库中,并与创建嵌入的原始内容进行一定的关联。
- When the application issues a query, we use the same embedding model to create embeddings for the query, and use those embeddings to query the database for similar vector embeddings. And as mentioned before, those similar embeddings are associated with the original content that was used to create them. 当应用程序发出查询时,我们使用相同的嵌入模型为查询创建嵌入,并使用这些嵌入向数据库查询相似的向量嵌入。正如之前提到的,这些相似的嵌入与用于创建它们的原始内容相关联。
向量索引和向量数据库之间有什么区别呢?
Standalone vector indices like FAISS (Facebook AI Similarity Search) can significantly improve search and retrieval of vector embeddings, but they lack capabilities that exist in any database. Vector databases, on the other hand, are purpose-built to manage vector embeddings, providing several advantages over using standalone vector indices:
独立的向量索引(例如FAISS,Facebook AI Similarity Search)可以显著改善对向量嵌入的搜索和检索,但它们缺乏任何数据库存在的功能。相反,向量数据库是专门构建的用于管理向量嵌入的数据库,相较于使用独立的向量索引,它们提供了几个优势:
- Data management: Vector databases offer well-known and easy-to-use features for data storage, like inserting, deleting, and updating data. This makes managing and maintaining vector data easier than using a standalone vector index like FAISS, which requires additional work to integrate with a storage solution. 数据管理:向量数据库提供了众所周知且易于使用的数据存储功能,如插入、删除和更新数据。这使得管理和维护向量数据比使用独立的向量索引(如FAISS)更加简便,后者需要额外的工作来与存储解决方案集成。
- Metadata storage and filtering: Vector databases can store metadata associated with each vector entry. Users can then query the database using additional metadata filters for finer-grained queries. 元数据存储和过滤:向量数据库可以存储与每个向量条目相关联的元数据。用户可以使用额外的元数据过滤器对数据库进行查询,以进行更精细的查询。
- Scalability: Vector databases are designed to scale with growing data volumes and user demands, providing better support for distributed and parallel processing. Standalone vector indices may require custom solutions to achieve similar levels of scalability (such as deploying and managing them on Kubernetes clusters or other similar systems). 可伸缩性:向量数据库被设计为能够随着数据量的增长和用户需求的提升而进行扩展,为分布式和并行处理提供更好的支持。而独立的向量索引可能需要定制的解决方案才能达到类似的可伸缩性水平(例如在Kubernetes集群或其他类似系统上部署和管理)。
- Real-time updates: Vector databases often support real-time data updates, allowing for dynamic changes to the data, whereas standalone vector indexes may require a full re-indexing process to incorporate new data, which can be time-consuming and computationally expensive. 实时更新:向量数据库通常支持实时数据更新,允许对数据进行动态更改,而独立的向量索引可能需要进行完整的重新索引过程才能纳入新数据,这可能耗时且计算成本较高。
- Backups and collections: Vector databases handle the routine operation of backing up all the data stored in the database. Pinecone also allows users to selectively choose specific indexes that can be backed up in the form of “collections,” which store the data in that index for later use. 备份和集合:向量数据库处理存储在数据库中的所有数据的常规备份操作。Pinecone还允许用户选择性地选择特定索引,以“集合”的形式进行备份,这些集合将该索引中的数据存储起来以供以后使用。
- Ecosystem integration: Vector databases can more easily integrate with other components of a data processing ecosystem, such as ETL pipelines (like Spark), analytics tools (like Tableau and Segment), and visualization platforms (like Grafana) – streamlining the data management workflow. It also enables easy integration with other AI related tools like LangChain, LlamaIndex and ChatGPT’s Plugins. 生态系统集成:向量数据库可以更轻松地与数据处理生态系统的其他组件集成,如ETL流水线(如Spark)、分析工具(如Tableau和Segment)和可视化平台(如Grafana),从而简化数据管理工作流程。它还可以与其他与AI相关的工具(如LangChain、LlamaIndex和ChatGPT 插件)轻松集成。
- Data security and access control: Vector databases typically offer built-in data security features and access control mechanisms to protect sensitive information, which may not be available in standalone vector index solutions. 数据安全和访问控制:向量数据库通常提供内置的数据安全功能和访问控制机制,以保护敏感信息,而这些功能在独立的向量索引解决方案中可能不可用。
In short, a vector database provides a superior solution for handling vector embeddings by addressing the limitations of standalone vector indices, such as scalability challenges, cumbersome integration processes, and the absence of real-time updates and built-in security measures, ensuring a more effective and streamlined data management experience.
简而言之,向量数据库通过解决独立的向量索引存在的可伸缩性挑战、繁琐的集成过程以及缺乏实时更新和内置安全措施等局限性,为处理向量嵌入提供了一种更优越的解决方案,确保了更有效和简化的数据管理体验。
一个向量数据库是如何工作的呢?
We all know how traditional databases work (more or less)—they store strings, numbers, and other types of scalar data in rows and columns. On the other hand, a vector database operates on vectors, so the way it’s optimized and queried is quite different.
我们都知道传统数据库的工作原理(或多或少)——它们以行和列的形式存储字符串、数字和其他类型的标量数据。另一方面,向量数据库是针对向量进行操作的,因此它的优化和查询方式是非常不同的。
In traditional databases, we are usually querying for rows in the database where the value usually exactly matches our query. In vector databases, we apply a similarity metric to find a vector that is the most similar to our query.
在传统数据库中,我们通常查询的是数据库中值与我们查询条件完全匹配的行。而在向量数据库中,我们应用相似度度量方法来查找与我们查询最相似的向量。
A vector database uses a combination of different algorithms that all participate in Approximate Nearest Neighbor (ANN) search. These algorithms optimize the search through hashing, quantization, or graph-based search.
一个向量数据库使用多种不同的算法,所有这些算法都参与了近似最近邻(ANN)搜索。这些算法通过哈希、量化或基于图的搜索来优化搜索过程。
These algorithms are assembled into a pipeline that provides fast and accurate retrieval of the neighbors of a queried vector. Since the vector database provides approximate results, the main trade-offs we consider are between accuracy and speed. The more accurate the result, the slower the query will be. However, a good system can provide ultra-fast search with near-perfect accuracy.
这些算法被组装成一个流水线,以提供对查询向量的邻居快速准确的检索。由于向量数据库提供近似结果,我们考虑的主要权衡是准确性和速度之间的平衡。结果越准确,查询速度就越慢。然而,一个优秀的系统可以提供近乎完美准确性的超快速搜索。
Here’s a common pipeline for a vector database:
下面是一个向量数据库的常见流水线示例:
- Indexing: The vector database indexes vectors using an algorithm such as PQ, LSH, or HNSW (more on these below). This step maps the vectors to a data structure that will enable faster searching. 索引:向量数据库使用PQ、LSH或HNSW等算法对向量进行索引(下面将更详细介绍)。此步骤将向量映射到一种数据结构,以实现更快的搜索。
- Querying: The vector database compares the indexed query vector to the indexed vectors in the dataset to find the nearest neighbors (applying a similarity metric used by that index) 查询:向量数据库将查询向量与数据集中的索引向量进行比较,以找到最近邻(应用该索引使用的相似度度量)。
- Post Processing: In some cases, the vector database retrieves the final nearest neighbors from the dataset and post-processes them to return the final results. This step can include re-ranking the nearest neighbors using a different similarity measure. 后处理:在某些情况下,向量数据库从数据集中检索最终的最近邻,并对它们进行后处理以返回最终结果。此步骤可以包括使用不同的相似度度量对最近邻进行重新排序。
In the following sections, we will discuss each of these algorithms in more detail and explain how they contribute to the overall performance of a vector database.
在接下来的几节中,我们将详细讨论这些算法,并解释它们如何对向量数据库的整体性能做出贡献。
算法
Several algorithms can facilitate the creation of a vector index. Their common goal is to enable fast querying by creating a data structure that can be traversed quickly. They will commonly transform the representation of the original vector into a compressed form to optimize the query process.
有几种算法可以帮助创建向量索引。它们的共同目标是通过创建一个可以快速遍历的数据结构来实现快速查询。通常,它们会将原始向量的表示转换为压缩形式,以优化查询过程。
However, as a user of Pinecone, you don’t need to worry about the intricacies and selection of these various algorithms. Pinecone is designed to handle all the complexities and algorithmic decisions behind the scenes, ensuring you get the best performance and results without any hassle. By leveraging Pinecone’s expertise, you can focus on what truly matters – extracting valuable insights and delivering powerful AI solutions.
然而,作为Pinecone的用户,您无需担心这些各种算法的复杂性和选择。Pinecone旨在处理所有复杂性和算法决策,确保您获得最佳性能和结果,无需任何麻烦。借助Pinecone的专业知识,您可以专注于真正重要的事情——提取有价值的见解和提供强大的人工智能解决方案。
The following sections will explore several algorithms and their unique approaches to handling vector embeddings. This knowledge will empower you to make informed decisions and appreciate the seamless performance Pinecone delivers as you unlock the full potential of your application.
接下来的几节将探讨几种算法及其处理向量嵌入的独特方法。这些知识将使您能够做出明智的决策,并欣赏Pinecone提供的无缝性能,释放应用程序的全部潜力。
Random Projection 随机投影
The basic idea behind random projection is to project the high-dimensional vectors to a lower-dimensional space using a random projection matrix. We create a matrix of random numbers. The size of the matrix is going to be the target low-dimension value we want. We then calculate the dot product of the input vectors and the matrix, which results in a projected matrix that has fewer dimensions than our original vectors but still preserves their similarity.
随机投影的基本思想是使用随机投影矩阵将高维向量投影到一个低维空间。我们创建一个由随机数组成的矩阵,矩阵的大小将是我们所需的目标低维值。然后,我们计算输入向量和矩阵的点积,得到一个投影矩阵,它比我们原始向量的维度要低,但仍保留了它们之间的相似性。
When we query, we use the same projection matrix to project the query vector onto the lower-dimensional space. Then, we compare the projected query vector to the projected vectors in the database to find the nearest neighbors. Since the dimensionality of the data is reduced, the search process is significantly faster than searching the entire high-dimensional space.
在进行查询时,我们使用相同的投影矩阵将查询向量投影到低维空间上。然后,我们将投影的查询向量与数据库中的投影向量进行比较,以找到最近邻。由于数据的维度降低了,搜索过程比在整个高维空间中搜索要快得多。
Just keep in mind that random projection is an approximate method, and the projection quality depends on the properties of the projection matrix. In general, the more random the projection matrix is, the better the quality of the projection will be. But generating a truly random projection matrix can be computationally expensive, especially for large datasets. Learn more about random projection.
请记住,随机投影是一种近似方法,投影质量取决于投影矩阵的特性。通常情况下,投影矩阵越随机,投影质量就越好。但是,生成一个真正随机的投影矩阵可能会计算成本高昂,特别是对于大型数据集来说。了解更多关于随机投影的信息。
Product Quantization
Another way to build an index is product quantization (PQ), which is a lossy compression technique for high-dimensional vectors (like vector embeddings). It takes the original vector, breaks it up into smaller chunks, simplifies the representation of each chunk by creating a representative “code” for each chunk, and then puts all the chunks back together - without losing information that is vital for similarity operations. The process of PQ can be broken down into four steps: splitting, training, encoding, and querying.
构建索引的另一种方法是产品量化(Product Quantization,PQ),它是一种用于高维向量(如向量嵌入)的有损压缩技术。它将原始向量分割成较小的块,通过为每个块创建一个代表性的“码”来简化每个块的表示,然后将所有块重新组合在一起,同时不丢失对于相似性操作至关重要的信息。PQ的过程可以分为四个步骤:分割、训练、编码和查询。
- Splitting -The vectors are broken into segments. 分割 - 向量被分割成段。
- Training - we build a “codebook” for each segment. Simply put - the algorithm generates a pool of potential “codes” that could be assigned to a vector. In practice - this “codebook” is made up of the center points of clusters created by performing k-means clustering on each of the vector’s segments. We would have the same number of values in the segment codebook as the value we use for the k-means clustering. 训练 - 我们为每个段构建一个“码书”(codebook)。简单来说,该算法生成了一个潜在的“码”池,可以分配给向量。在实践中,这个“码书”由对向量的每个段执行k均值聚类所创建的聚类中心点组成。在段的码书中,我们将有与我们用于k均值聚类的值相同数量的数值。
- Encoding - The algorithm assigns a specific code to each segment. In practice, we find the nearest value in the codebook to each vector segment after the training is complete. Our PQ code for the segment will be the identifier for the corresponding value in the codebook. We could use as many PQ codes as we’d like, meaning we can pick multiple values from the codebook to represent each segment. 编码 - 算法将为每个段分配一个特定的码。在实践中,在训练完成后,我们找到与每个向量段最接近的码书中的值。我们的PQ码将作为对应码书中的值的标识符来表示该段。我们可以使用任意数量的PQ码,这意味着我们可以从码书中选择多个值来表示每个段。
- Querying - When we query, the algorithm breaks down the vectors into sub-vectors and quantizes them using the same codebook. Then, it uses the indexed codes to find the nearest vectors to the query vector. 查询 - 在查询时,算法将向量分解为子向量,并使用相同的码书对其进行量化。然后,它使用索引的码来找到与查询向量最近的向量。
The number of representative vectors in the codebook is a trade-off between the accuracy of the representation and the computational cost of searching the codebook. The more representative vectors in the codebook, the more accurate the representation of the vectors in the subspace, but the higher the computational cost to search the codebook. By contrast, the fewer representative vectors in the codebook, the less accurate the representation, but the lower the computational cost. Learn more about PQ.
码书中代表向量的数量是表示准确性和搜索码书的计算成本之间的权衡。码书中的代表向量越多,子空间中向量的表示越准确,但搜索码书的计算成本越高。相比之下,码书中的代表向量越少,表示的准确性就越低,但计算成本也越低。了解更多关于PQ的信息。
Locality-sensitive hashing 局部敏感哈希
Locality-Sensitive Hashing (LSH) is a technique for indexing in the context of an approximate nearest-neighbor search. It is optimized for speed while still delivering an approximate, non-exhaustive result. LSH maps similar vectors into “buckets” using a set of hashing functions, as seen below:
局部敏感哈希(Locality-Sensitive Hashing,LSH)是一种用于近似最近邻搜索的索引技术。它在提供近似非穷举结果的同时,针对速度进行了优化。LSH使用一组哈希函数将相似的向量映射到“桶”中,具体如下所示:
To find the nearest neighbors for a given query vector, we use the same hashing functions used to “bucket” similar vectors into hash tables. The query vector is hashed to a particular table and then compared with the other vectors in that same table to find the closest matches. This method is much faster than searching through the entire dataset because there are far fewer vectors in each hash table than in the whole space.
为了找到给定查询向量的最近邻,我们使用相同的哈希函数将相似向量“分桶”到哈希表中。查询向量被哈希到特定的表中,然后与该表中的其他向量进行比较,以找到最接近的匹配项。这种方法比在整个数据集中搜索要快得多,因为每个哈希表中的向量数量远远少于整个空间中的向量数量。
It’s important to remember that LSH is an approximate method, and the quality of the approximation depends on the properties of the hash functions. In general, the more hash functions used, the better the approximation quality will be. However, using a large number of hash functions can be computationally expensive and may not be feasible for large datasets. Learn more about LSH.
请记住,局部敏感哈希(LSH)是一种近似方法,近似结果的质量取决于哈希函数的特性。通常情况下,使用的哈希函数越多,近似质量就越好。然而,使用大量的哈希函数可能会计算成本高昂,并且对于大型数据集可能不可行。了解更多关于LSH的信息。
Hierarchical Navigable Small World (HNSW) 分层可导航小世界
HNSW creates a hierarchical, tree-like structure where each node of the tree represents a set of vectors. The edges between the nodes represent the similarity between the vectors. The algorithm starts by creating a set of nodes, each with a small number of vectors. This could be done randomly or by clustering the vectors with algorithms like k-means, where each cluster becomes a node.
HNSW创建了一个分层的、类似树状结构的数据结构,其中树的每个节点代表一组向量。节点之间的边表示向量之间的相似性。该算法首先创建一组节点,每个节点包含少量的向量。可以通过随机选择或使用k-means等聚类算法对向量进行聚类来实现此操作,其中每个聚类成为一个节点。
The algorithm then examines the vectors of each node and draws an edge between that node and the nodes that have the most similar vectors to the one it has.
然后,该算法检查每个节点的向量,并在该节点与具有最相似向量的节点之间绘制边。
When we query an HNSW index, it uses this graph to navigate through the tree, visiting the nodes that are most likely to contain the closest vectors to the query vector. Learn more about HNSW.
当我们查询HNSW索引时,它使用这个图来在树结构中导航,访问最有可能包含与查询向量最接近的向量的节点。了解更多关于HNSW的信息。
相似度度量
Building on the previously discussed algorithms, we need to understand the role of similarity measures in vector databases. These measures are the foundation of how a vector database compares and identifies the most relevant results for a given query.
在之前讨论的算法基础上,我们需要了解相似度度量在向量数据库中的作用。这些度量是向量数据库比较和确定给定查询的最相关结果的基础。
Similarity measures are mathematical methods for determining how similar two vectors are in a vector space. Similarity measures are used in vector databases to compare the vectors stored in the database and find the ones that are most similar to a given query vector.
相似度度量是一种数学方法,用于确定向量空间中两个向量的相似程度。在向量数据库中,相似度度量用于比较存储在数据库中的向量,并找到与给定查询向量最相似的向量。
Several similarity measures can be used, including:
可以使用多种相似度度量方法,包括:
- cosine similarity: measures the cosine of the angle between two vectors in a vector space. It ranges from -1 to 1, where 1 represents identical vectors, 0 represents orthogonal vectors, and -1 represents vectors that are diametrically opposed. 余弦相似度:衡量向量空间中两个向量之间夹角的余弦值。取值范围从-1到1,其中1表示相同的向量,0表示正交的向量,-1表示完全相反的向量。
- Euclidean distance: measures the straight-line distance between two vectors in a vector space. It ranges from 0 to infinity, where 0 represents identical vectors, and larger values represent increasingly dissimilar vectors. 欧氏距离:衡量向量空间中两个向量之间的直线距离。取值范围从0到正无穷,其中0表示相同的向量,而较大的值表示越不相似的向量。
- Dot product: measures the product of the magnitudes of two vectors and the cosine of the angle between them. It ranges from -∞ to ∞, where a positive value represents vectors that point in the same direction, 0 represents orthogonal vectors, and a negative value represents vectors that point in opposite directions. 点积:衡量两个向量的幅值乘积与它们之间夹角的余弦值。取值范围从负无穷到正无穷,其中正值表示指向相同方向的向量,0表示正交的向量,负值表示指向相反方向的向量。
The choice of similarity measure will have an effect on the results obtained from a vector database. It is also important to note that each similarity measure has its own advantages and disadvantages, and it is important to choose the right one depending on the use case and requirements. Learn more about similarity measures.
选择相似度度量方法将对从向量数据库获得的结果产生影响。同时需要注意的是,每种相似度度量方法都有其自身的优缺点,根据使用情况和要求选择合适的方法非常重要。了解更多关于相似度度量的信息。
过滤
Every vector stored in the database also includes metadata. In addition to the ability to query for similar vectors, vector databases can also filter the results based on a metadata query. To do this, the vector database usually maintains two indexes: a vector index and a metadata index. It then performs the metadata filtering either before or after the vector search itself, but in either case, there are difficulties that cause the query process to slow down.
数据库中存储的每个向量还包括元数据。除了可以查询相似向量外,向量数据库还可以根据元数据查询来过滤结果。为此,向量数据库通常维护两个索引:一个是向量索引,另一个是元数据索引。然后,在向量搜索本身之前或之后执行元数据过滤,但在任何情况下,都存在导致查询过程变慢的困难。
The filtering process can be performed either before or after the vector search itself, but each approach has its own challenges that may impact the query performance:
过滤过程可以在向量搜索之前或之后进行,但每种方法都有其自身的挑战,可能会影响查询性能:
- Pre-filtering: In this approach, metadata filtering is done before the vector search. While this can help reduce the search space, it may also cause the system to overlook relevant results that don’t match the metadata filter criteria. Additionally, extensive metadata filtering may slow down the query process due to the added computational overhead. 预过滤:在这种方法中,元数据过滤在向量搜索之前进行。虽然这可以帮助减少搜索空间,但也可能导致系统忽略不符合元数据过滤条件的相关结果。此外,由于增加了计算开销,大量的元数据过滤可能会导致查询过程变慢。
- Post-filtering: In this approach, the metadata filtering is done after the vector search. This can help ensure that all relevant results are considered, but it may also introduce additional overhead and slow down the query process as irrelevant results need to be filtered out after the search is complete. 后过滤:在这种方法中,元数据过滤在向量搜索之后进行。这可以确保考虑所有相关结果,但也可能引入额外开销,并在搜索完成后需要过滤掉不相关的结果,从而导致查询过程变慢。
To optimize the filtering process, vector databases use various techniques, such as leveraging advanced indexing methods for metadata or using parallel processing to speed up the filtering tasks. Balancing the trade-offs between search performance and filtering accuracy is essential for providing efficient and relevant query results in vector databases. Learn more about vector search filtering.
为了优化过滤过程,向量数据库使用各种技术,例如利用先进的索引方法进行元数据处理,或使用并行处理来加速过滤任务。在向量数据库中,平衡搜索性能和过滤准确性之间的权衡是提供高效和相关的查询结果的关键。了解更多关于向量搜索过滤的信息。
数据库操作
Unlike vector indexes, vector databases are equipped with a set of capabilities that makes them better qualified to be used in high scale production settings. Let’s take a look at an overall overview of the components that are involved in operating the database.
与向量索引不同,向量数据库具备一系列功能,使其更适合在大规模生产环境中使用。让我们来看一下操作数据库所涉及的组件的整体概述。
性能与容错
Performance and fault tolerance are tightly related. The more data we have, the more nodes that are required - and the bigger chance for errors and failures. As is the case with other types of databases, we want to ensure that queries are executed as quickly as possible even if some of the underlying nodes fail. This could be due to hardware failures, network failures, or other types of technical bugs. This kind of failure could result in downtime or even incorrect query results.
性能和容错性紧密相关。数据量越大,所需的节点数量就越多,出现错误和故障的可能性也就越大。与其他类型的数据库一样,我们希望即使底层节点发生故障,也能尽快执行查询。这可能是由于硬件故障、网络故障或其他类型的技术错误。这种故障可能导致停机时间或甚至查询结果不正确。
To ensure both high performance and fault tolerance, vector databases use sharding and replication apply the following:
为了确保高性能和容错性,向量数据库使用分片和复制,并应用以下策略:
- Sharding - partitioning the data across multiple nodes. There are different methods for partitioning the data - for example, it can be partitioned by the similarity of different clusters of data so that similar vectors are stored in the same partition. When a query is made, it is sent to all the shards and the results are retrieved and combined. This is called the “scatter-gather” pattern. 分片 - 将数据分割到多个节点中。有不同的方法来分割数据 - 例如,可以按照数据不同簇的相似性进行分区,使相似的向量存储在同一个分区中。当进行查询时,查询会发送到所有分片,并检索和合并结果。这被称为“scatter-gather”模式。
- Replication - creating multiple copies of the data across different nodes. This ensures that even if a particular node fails, other nodes will be able to replace it. There are two main consistency models: eventual consistency and strong consistency. Eventual consistency allows for temporary inconsistencies between different copies of the data which will improve availability and reduce latency but may result in conflicts and even data loss. On the other hand, strong consistency requires that all copies of the data are updated before a write operation is considered complete. This approach provides stronger consistency but may result in higher latency. 复制 - 在不同节点上创建数据的多个副本。这样即使特定节点发生故障,其他节点也能够替代它。有两种主要的一致性模型:最终一致性和强一致性。最终一致性允许在数据的不同副本之间存在临时的不一致,这将提高可用性并减少延迟,但可能导致冲突甚至数据丢失。另一方面,强一致性要求在写操作被视为完成之前,所有数据的副本都要更新。这种方法提供了更强的一致性,但可能导致更高的延迟。
监控
To effectively manage and maintain a vector database, we need a robust monitoring system that tracks the important aspects of the database’s performance, health, and overall status. Monitoring is critical for detecting potential problems, optimizing performance, and ensuring smooth production operations. Some aspects of monitoring a vector database include the following:
为了有效地管理和维护向量数据库,我们需要一个强大的监控系统,跟踪数据库的性能、健康状况和整体状态的重要方面。监控对于检测潜在问题、优化性能和确保生产运行的顺利进行至关重要。监控向量数据库的一些方面包括以下内容:
- Resource usage - monitoring resource usage, such as CPU, memory, disk space, and network activity, enables the identification of potential issues or resource constraints that could affect the performance of the database. 资源使用情况 - 监控资源使用情况,如CPU、内存、磁盘空间和网络活动,可以识别可能影响数据库性能的潜在问题或资源限制。
- Query performance - query latency, throughput, and error rates may indicate potential systemic issues that need to be addressed. 查询性能 - 查询延迟、吞吐量和错误率可能指示存在需要解决的潜在系统问题。
- System health - overall system health monitoring includes the status of individual nodes, the replication process, and other critical components. 系统健康状况 - 整体系统健康监控包括单个节点的状态、复制过程以及其他关键组件的状态。
访问控制
Access control is the process of managing and regulating user access to data and resources. It is a vital component of data security, ensuring that only authorized users have the ability to view, modify, or interact with sensitive data stored within the vector database.
访问控制是管理和调控用户对数据和资源的访问的过程。它是数据安全的重要组成部分,确保只有经授权的用户才能查看、修改或与存储在向量数据库中的敏感数据进行交互。
Access control is important for several reasons:
访问控制之所以重要有以下几个原因:
- Data protection: As AI applications often deal with sensitive and confidential information, implementing strict access control mechanisms helps safeguard data from unauthorized access and potential breaches. 数据保护:由于AI应用程序通常涉及敏感和机密信息,实施严格的访问控制机制有助于保护数据免受未经授权的访问和潜在的违规行为的侵害。
- Compliance: Many industries, such as healthcare and finance, are subject to strict data privacy regulations. Implementing proper access control helps organizations comply with these regulations, protecting them from legal and financial repercussions. 合规性:许多行业,如医疗保健和金融,受到严格的数据隐私法规的约束。实施适当的访问控制有助于组织遵守这些法规,保护其免受法律和财务方面的影响。
- Accountability and auditing: Access control mechanisms enable organizations to maintain a record of user activities within the vector database. This information is crucial for auditing purposes, and when security breaches happen, it helps trace back any unauthorized access or modifications. 问责和审计:访问控制机制使组织能够记录向量数据库中用户的活动。这些信息对于审计目的至关重要,当发生安全漏洞时,它有助于追踪任何未经授权的访问或修改。
- Scalability and flexibility: As organizations grow and evolve, their access control needs may change. A robust access control system allows for seamless modification and expansion of user permissions, ensuring that data security remains intact throughout the organization’s growth. 可扩展性和灵活性:随着组织的发展和演变,其访问控制需求可能会发生变化。强大的访问控制系统允许无缝修改和扩展用户权限,确保数据安全在组织的增长过程中得以保持完整。
备份和集合
When all else fails, vector databases offer the ability to rely on regularly created backups. These backups can be stored on external storage systems or cloud-based storage services, ensuring the safety and recoverability of the data. In case of data loss or corruption, these backups can be used to restore the database to a previous state, minimizing downtime and impact on the overall system. With Pinecone, users can choose to back up specific indexes as well and save them as “collections,” which can later be used to populate new indexes.
当其他方法失败时,向量数据库提供了依赖于定期创建的备份的能力。这些备份可以存储在外部存储系统或基于云的存储服务中,确保数据的安全性和可恢复性。在数据丢失或损坏的情况下,可以使用这些备份将数据库恢复到先前的状态,最大程度地减少停机时间和对整个系统的影响。使用Pinecone,用户可以选择备份特定的索引,并将其保存为“集合”,以后可以用来填充新的索引。
API 和 SDKs
This is where the rubber meets the road: Developers who interact with the database want to do so with an easy-to-use API, using a toolset that is familiar and comfortable. By providing a user-friendly interface, the vector database API layer simplifies the development of high-performance vector search applications.
这是开发人员与数据库进行交互的地方:他们希望使用易于使用的API,并使用熟悉和舒适的工具集。通过提供用户友好的接口,向量数据库API层简化了高性能向量搜索应用程序的开发过程。
In addition to the API, vector databases would often provide programming language specific SDKs that wrap the API. The SDKs make it even easier for developers to interact with the database in their applications. This allows developers to concentrate on their specific use cases, such as semantic text search, generative question-answering, hybrid search, image similarity search, or product recommendations, without having to worry about the underlying infrastructure complexities.
除了API,向量数据库通常会提供特定编程语言的SDK,用于封装API。SDK使开发人员更容易在应用程序中与数据库进行交互。这使开发人员能够专注于特定的用例,如语义文本搜索、生成式问答、混合搜索、图像相似性搜索或产品推荐,而无需担心底层基础设施的复杂性。
总结
The exponential growth of vector embeddings in fields such as NLP, computer vision, and other AI applications has resulted in the emergence of vector databases as the computation engine that allows us to interact effectively with vector embeddings in our applications.
在自然语言处理(NLP)、计算机视觉和其他人工智能应用等领域中,向量嵌入的指数增长导致了向量数据库的出现,它作为计算引擎使我们能够在应用程序中有效地与向量嵌入进行交互。
Vector databases are purpose-built databases that are specialized to tackle the problems that arise when managing vector embeddings in production scenarios. For that reason, they offer significant advantages over traditional scalar-based databases and standalone vector indexes.
向量数据库是专门针对在生产环境中管理向量嵌入时出现的问题而构建的数据库。因此,它们相对于传统的基于标量的数据库和独立的向量索引具有显著优势。
In this post, we reviewed the key aspects of a vector database, including how it works, what algorithms it uses, and the additional features that make it operationally ready for production scenarios. We hope this helps you understand the inner workings of vector databases. Luckily, this isn’t something you must know to use Pinecone. Pinecone takes care of all of these considerations (and then some) and frees you to focus on the rest of your application.
在本文中,我们回顾了向量数据库的关键方面,包括它的工作原理、使用的算法以及使其在生产环境中运行就绪的其他功能。我们希望这能帮助您理解向量数据库的内部工作原理。幸运的是,您并不需要了解所有这些内容才能使用Pinecone。Pinecone会处理所有这些考虑因素(甚至更多),使您能够专注于应用程序的其他方面。
本文作者:Maeiee
本文链接:What is a Vector Database?
版权声明:如无特别声明,本文即为原创文章,版权归 Maeiee 所有,未经允许不得转载!
喜欢我文章的朋友请随缘打赏,鼓励我创作更多更好的作品!